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Theory of Mind (ToM) has received significant research attention. Traditional TolVl 
research has provided important understanding of how hunnans reason about mental 
states by utilizing shared world knowledge, social cues, and the interpretation of 
actions; however, many current behavioral paradigms are limited to static, "third-person" 
protocols. Emerging experimental approaches such as cognitive simulation and simulated 
social interaction offer opportunities to investigate ToM in interactive, "first-person" and 
"second-person" scenarios while affording greater experimental control. The advantages 
and limitations of traditional and emerging ToM methodologies are discussed with the 
intent of advancing the understanding of ToM in socially mediated situations. 

Keywords: theory of mind (ToM), social perception, cognitive simulation, simulated social interaction, social 
cognition 



INTRODUCTION 

Every day, humans engage in a wide variety of social interactions 
to achieve a diverse set of social goals that include acquiring infor- 
mation, influencing a partner's behavior through, and maintain- 
ing emotional intimacy through sharing thoughts and feelings. 
Integral to an individual's success in these social encounters is 
his or her ability to reason about the thoughts, beliefs, and feel- 
ings of others to predict behavioral responses. This ability has 
been termed theory of mind (ToM; Premack and Woodruff, 1978; 
Baron-Cohen et al, 1985). 

The social importance of ToM can be demonstrated by imagin- 
ing social interaction without it. To demonstrate the difficulty of 
explaining human behavior without ToM, Baron-Cohen (1997) 
used the task of explaining the behavior of a man who walks 
into a room, looks around, and then simply exits the room. 
Plausible mentalistic explanations can be easily generated for this 
scenario (e.g., the man forgot why he entered the room, the man 
was looking for something in the room and discovered it was 
not there); however, developing plausible non-mentalistic expla- 
nations of the man's behavior is more difficult (Baron-Cohen, 
1997). The difficulties in generating concise and probable non- 
mentalistic explanations for this set of simple behaviors illustrate, 
on a small scale, the confusion that might result from complex 
behaviors like deception, persuasion, or flirting in the absence of 
ToM. Indeed, recognition of ToM's importance for social func- 
tioning has sparked extensive research in clinical populations for 
whom challenges in social interaction are common, including 
individuals with autism spectrum disorders (Perner et al, 1989; 
Happe, 1994; Baron-Cohen et al, 1995, 1999, 2001; Hill, 2004; 
Losh et al., 2012), schizophrenia (Corcoran et al., 1995; Briine 
et al., 2007; Champagne-Lavau and Stip, 2010; Couture et al., 
2011; Hooker et al., 2011), and traumatic brain injury (Bibby 
and McDonald, 2005; Havet-Thomassin et al., 2006; Henry et al., 
2006; Milders, 2006; MuUer et al, 2010; Turkstra et al., 2004; 
Turkstra, 2008). 



Given the importance of ToM in daily interactions and the 
prevalence of ToM deficits in some clinical populations, it is 
important for researchers to critically consider both the concept 
of ToM as well as tasks used to investigate it. Accordingly, the 
goals of this article are to (1) present a framework of mecha- 
nisms that allow humans to infer and reason about mental states 
in social interaction, (2) review the benefits and limitations of 
current behavioral tasks designed to test each mechanism, and 
(3) discuss potential new directions for studying and under- 
standing ToM, with consideration of both the advantages and 
the limitations that these approaches offer over more traditional 
techniques. It is our position that the knowledge to be gained 
from the incorporation of these new methodologies may advance 
not only the understanding of how humans reason about the 
mental states of others, but may also further sciences devoted 
to improving or compensating for ToM impairments and artifi- 
cial intelligence research that is focused on developing artificial 
models of human-like social processes. 

MECHANISMS OF ToM 

Since ToM research has been broad in terms of the interested dis- 
ciplines, target populations and testing methodologies used, it is 
important to critically consider the concept of ToM, especially 
the ways in which our conceptualizations influence the course 
of ToM research. Developing a "working definition" of ToM will 
help guide research on not only the underlying network of skills 
that facilitate ToM, but may also provide insight into where break- 
downs in ToM may occur. To begin forming such a definition, we 
posed the question, "How do humans accurately infer the men- 
tal states of others?" From this consideration, three components 
of interaction emerged as clues to ToM: (1) knowledge of the 
shared context, (2) perception of social cues, and (3) interpreta- 
tions of actions, (See Table 1 and Figure 1). These components, 
and the experimental tasks developed to assess each, are reviewed 
below. 
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Table 1 | A summary of tasks used to test key mechanisms of Theory of Mind. 



■Mechanism 



Type of task 



Example task 



Findings 



Shared world knowledge Text-based tasks 



Strange stories (Happe, 1994) 



Non-verbal picture-based tasks 



Character intention task (Sarfati 
et al., 1997) 



Individuals with autism have more trouble 
explaining the strange stories using mentalistic 
explanations than their peers without autism and 
those with mental disability (Happe, 1994; White 
etal., 2009). 

Adults with TBI and schizophrenia are less accurate 
at choosing appropriate endings to comic strip 
stories where mental state attribution is needed 

(Sarfati et al., 1997; Havet-Thomassin et al., 2006) 



Perceiving social cues 



Facial emotion recognition 



Reading the mind in the eyes task 
(Baron-Cohen etal., 2001) 



Adults with TBI and autism have more trouble 
identifying mental states based on facial affect 
displays (Baron-Cohen etal., 2001; 
Havet-Thomassin et al., 2006; Turkstra, 2008). 



Interpreting actions 



Facial/Vocal emotion recognition 



False belief tasks 



False belief tasks 



The awareness of social inference 
test (TASIT; McDonald et al., 
2006) 



Reality unknown false belief 
(Wimmer and Pernor, 1983) 



Appearance reality false belief 
(Flavell etal., 1983) 



Adults with TBI and schizophrenia are less accurate 
at identifying facial emotions than healthy, 
uninjured peers (McDonald et al., 2006; Sparks 
etal., 2010). 

Typically developing children begin to pass reality 
unknown false belief tasks around the age of 4 
years (Wimmer and Perner, 1983); however, 
children with autism may fail to pass this task 
(Baron-Cohen et al., 1985). 

Typically developing children begin to succeed on 
appearance reality tasks more frequently around 
the age of 4-years (Carlson et al., 2004). 



False belief tasks 



Second-order false belief (Perner 
and Wimmer, 1985) 



Typically developing children develop some 
competence in 2nd order false belief tasks 
between the ages of 6- and 7-years. 



SHARED WORLD KNOWLEDGE 

ToM, as one manifestation of cognition, is situated in the context 
of the surrounding world (Wilson, 2002). Consider, for exam- 
ple, the ToM demands implicit in a typical conversation. During 
a conversation, individuals must quickly infer their partners' 
thoughts, beliefs, emotions, and goals in order to formulate an 
appropriate response. As with other forms of joint action, making 
appropriate responses in conversation also requires the integra- 
tion of cues from the conversational partner and the context, 
including prior world knowledge (e.g., amount of personal space 
with which a partner might be comfortable), knowledge about 
the relationship between individuals (e.g., how much disclosure 
is appropriate with a close friend vs. a co-worker), the goal of 
the interaction (e.g., what information is required to complete 
a joint task), and the conditions under which the conversation 
will occur (e.g., in a group setting) to make quick, on-line guesses 
about a partners' mental states (for review see Sebanz et al., 2006; 
Knoblich etal, 2011). 

To investigate how shared world knowledge might facili- 
tate ToM, researchers commonly present participants with short 
descriptions or picture sets of social scenarios and then ask par- 
ticipants to infer the mental states of characters or to predict 



characters' behaviors based on these inferred mental states 
(Happe, 1994; Havet-Thomassin et al, 2006). Happes Strange 
Stories Task (1994) is one example of this type of task. The Strange 
Stories Task tests the ability to use prior world knowledge in order 
to understand several communication acts embedded in story 
situations, including faux pas, persuasion, pretending, and decep- 
tion, and to accurately select the intended story interpretation. 

Tasks like the Strange Stories Task, designed to assess an ability 
to reason about mental states through integration of shared world 
knowledge, are valuable in that they have been found to detect 
ToM deficits in both individuals with autism spectrum disor- 
ders and traumatic brain injury (Happe, 1994; Havet-Thomassin 
et al., 2006). However, it is important to consider that these tasks 
impose considerable cognitive demands while testing ToM, espe- 
cially demands on working memory and, in the case of verbal 
or text-based tasks, linguistic processing. For example, in inter- 
preting a story describing a scenario in which a character tells 
a white lie, participants must process the language of the story 
and hold the relevant information in working memory while that 
information is integrated with prior knowledge and an inter- 
pretation is selected. These demands become especially critical 
when investigating ToM of individuals who may have language 
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Sally goes for a walk. 




FIGURE 1 I A storyboard of the Sally-Ann or Location False Belief Task 
(WImmer and Perner, 1983) used to test attribution of mental states to 
others in children. 



disorders or deficits in working memory. Further, many of these 
tasks are passive and reflective in nature (e.g., Happe, 1994; 
Briine et al., 2007), which may overestimate ToM ability in daily 
life. For example, tasks requiring individuals to make mental 
state inferences from described scenarios generally allow ample 
response time and do not typically require individuals to for- 
mulate appropriate responses as if they themselves were in the 
situation. 

PERCEIVING SOCIAL CUES 

Another way in which humans infer the mental states of others 
is through the perception of various social cues. Though men- 
tal states are inherently cognitive phenomena, humans have a 
sophisticated repertoire of behaviors, including gaze cues, facial 
expressions, and vocal cues, through which they express their 
mental states. 

The perception of gaze cues as a method in which humans 
infer mental states is one of the most studied mechanisms of 
ToM. Gaze cues signal the basic direction or object of one's 
attention (Bayliss et al., 2007; Frischen et al, 2007), and by fol- 
lowing a partner's gaze, an individual is able to infer his or 
her partner's intentions (Baron-Cohen et al, 1995). An individ- 
ual's ability to use gaze-based cues to infer the knowledge of 
others develops in childhood, and these cues have been found 
to outweigh deceptive verbal cues in experimental paradigms 
(Freire et al., 2004). Additionally, gaze cues in conversation allow 



an individual to monitor understanding of his or her message 
(Clark and Krych, 2004) and also to signal a desire for partner 
feedback or to surrender the speaking floor, (Kendon, 1967; 
Duncan, 1972; Bavelas et al., 2002). Further, speaker gaze cues 
may work to resolve linguistic ambiguities in non-literal language 
like sarcasm, as Williams et al. (2009) reported. For example, 
in Western cultures speakers tend to look away from their part- 
ners while making sarcastic comments, signaling that the speaker 
does not actually believe what he or she is saying (Williams et al., 
2009). 

Emotion recognition likewise has received much research 
attention as, like gaze cues, facial, and vocal emotional cues are 
valuable in the inference of mental states (De Sonneville et al., 
2002). Typically, humans develop the ability to discriminate and 
perceive changes in facial expressions very early in life; however, 
the speed and accuracy with which children identify and match 
facial emotions continues to develop into adulthood (Barrera and 
Maurer, 1981; McClure, 2000). Both children and adults are gen- 
erally quicker and more accurate in their identification of positive 
emotions like "happy" as compared to negative emotions like 
"sad" (De Sonneville et al., 2002). Humans also convey emotional 
content in aspects of vocal production including vocal intensity, 
prosody, quality, and speech rate. 

Several tasks exist to study both gaze behavior and emo- 
tion recognition. Much has been learned about gaze behavior 
through the observation of dyadic interactions (e.g., Duncan, 
1972; Clark and Krych, 2004) as well as through experimen- 
tal manipulation of gaze cues (Baron-Cohen et al., 1995; Bayliss 
and Tipper, 2006). In a standard gaze perception task, indi- 
viduals are shown a face with the eyes either oriented straight 
ahead or shifted in one direction. From these images, participants 
are asked to make inferences about the characters' intentions or 
mental states (Frischen et al., 2007). Similar tasks have been devel- 
oped to assess comprehension of facial emotion (Ekman and 
Friesen, 1976). In one example, De Sonneville et al. (2002) pre- 
sented participants with four faces, each of which portrayed a 
different emotion, and asked participants to determine whether 
or not a target emotion was demonstrated in one of the four 
foils. These authors also used a matching emotion recogni- 
tion task, in which participants decided if two faces showed 
the same or different emotions (De Sonneville et al., 2002). 
Another method used to evaluate emotion recognition is to deter- 
mine how accurately participants identify emotions from facial 
expressions with varying levels of subtlety. Thomas et al. (2007) 
employed a task in which participants viewed photographs of 
people portraying different degrees of various emotions, from 
very subtle anger to very obvious happiness, to detect the accu- 
racy of participant emotion recognition. Vocal emotion recogni- 
tion can also be evaluated with tasks similar to those designed 
to measure facial emotion. In these tasks, participants gener- 
ally hear semantically neutral sentences with different forms of 
emotional prosody and are asked to identify the emotion of 
the speaker (Nowicki and Carton, 1993; Scherer and Scherer, 
2011). 

Like tasks requiring the use of shared world knowledge for 
ToM reasoning, social cue perception tasks have greatly con- 
tributed to what is understood about mental state reasoning. 
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These tasks too, however, share Umitations in their reflec- 
tive, offline design, and limited ecological validity. Considering 
emotion recognition tasks, for example, participants are often 
presented with decontextualized images of faces (e.g., Bowers 
et al, 1999; Baron-Cohen et al., 2001) and are asked to either 
identify the emotion or to match it to a target (De Sonneville et al., 
2002; Thomas et al., 2007). Additionally, even when stimuli are 
dynamic in the form of video clips (McDonald et al., 2006), par- 
ticipants are still given time to observe the stimulus, consider its 
properties, and make a judgment. In daily life, emotional displays, 
are fleeting but are rarely presented in isolation — redundant clues 
to mental states are presented in partners' words, faces, voices, 
and actions. This combination of presenting isolated social cues, 
which may underestimate actual abilities, and prolonged obser- 
vation and thinking time, which may overestimate abilities, make 
it difficult to establish an accurate picture of the perception of 
social cues in everyday interaction. These limitations are of clin- 
ical importance because the ability to infer mental states from 
social cues has been commonly studied as a means to better 
understand the impact of social deficits on functioning in every- 
day life (Spell and Frank, 2000; Baron-Cohen et al., 2001; Croker 
and McDonald, 2005; Tonks et al, 2007; Turkstra, 2008; Zupan 
et al, 2009). 

INTERPRETATION OF ACTIONS 

Research on the development of ToM has provided evidence that 
children as young as 6 months of age form expectations regarding 
how humans interact with other humans and inanimate objects 
(Legerstee et al., 2000). As humans, we generally believe that oth- 
ers act in ways that are consistent with their beliefs and goals 
(Heider and Simmel, 1944; Ajzen, 1991). Given this assumption, 
passively observing behavior can offer important clues regarding 
the intentions or beliefs of others. Several tasks have been devel- 
oped to evaluate participants' abilities to infer mental states from 
behavior (Wimmer and Perner, 1983). For example, in a standard 
false-belief task (e.g., Wimmer and Perner, 1983), participants 
infer a character's belief based on the observations of her actions. 

Tasks requiring the interpretation of actions are frequently 
used in developmental literature (Baron-Cohen et al, 1985; Luo, 
2011; Scott et al., 2012), but studies of joint action have also 
provided insight into how humans interpret actions to infer 
the mental states of others (for review see Sebanz et al., 2006; 
Knoblich et al., 2011). For example, Sebanz et al. (2006) sug- 
gest that using gaze cues to infer what someone is attending to, 
as well as knowing what task the person is engaged in, helps 
observers to predict others' action goals. Supporting the rela- 
tionship between observing actions and inferring mental states, 
Ramnani and Miall (2004) trained participants to perform a 
button-press task in response to visual symbols, with the symbol 
color indicating whether the participant, a partner, or the com- 
puter should respond. Neuroimaging data from this experiment 
suggested that predicting another's actions (i.e., predicting when 
a partner should act) activated neural regions important for ToM 
(Ramnani and Miall, 2004). 

While tasks designed to test ToM through the observation of 
actions are inherently passive in nature, joint action tasks, like that 
used by Ramnani and Miall, have allowed for the study of ToM 



abilities in simulated interactions as opposed to simply reasoning 
about social scenarios as third-party observers. 

EMERGING TOOLS AND METHODS TO STUDY ToM 

The mechanisms discussed above allow individuals to draw on 
information about the actions, behaviors, and knowledge of oth- 
ers to make inferences about their thoughts, beliefs, feelings, and 
intentions. Individuals gather this information through recipro- 
cal interactions and process it on-line to make ToM inferences 
and determine subsequent behavior. To study this interactive, 
on-line social-cognitive process, research on ToM has primarily 
used experimental paradigms that involve participants making 
ToM inferences from stimuli presented as static images (Baron- 
Cohen et al, 2001; De Sonneville et al., 2002; Mutlu et al, 
2009), textual stories (Happe et al., 1998), or video vignettes 
(Turkstra, 2008) that provide an observation-based, reflective 
"third-person" understanding of ToM (Frith and Frith, 2006; 
Schilbach et al, 2012). Schilbach et al. (2012) argue that ToM— 
and social cognition in general — has fundamentally different 
motivational consequences and underlying neural processes when 
individuals are socially and emotionally engaged with others than 
when they are third-person observers. Interacting with others 
provides individuals with the ability to perform active conver- 
sational roles, which might include initiating or responding to 
episodes of interaction, rather than observing the interaction as 
a bystander. This active involvement also facilitates shared goals, 
intentions, and actions among the participants of the interac- 
tion, providing individuals with the ability to draw on firsthand 
experience in making ToM inferences. 

Recent research has highlighted the limitations of experimen- 
tal methods that provide "third-person" evaluations of social 
phenomena and has proposed a "second-person" approach to 
studying social cognition (Wilms et al, 2010; Duff et al., 2012; 
Risko et al, 2012; Schilbach et al, 2012). Risko et al. (2012) 
suggested that experimental paradigms designed to study social 
cognition form a continuum between simple, static representa- 
tions of socially relevant stimuli and actual live social interac- 
tion or between "reel" and "real" instances of interaction. This 
continuum includes static schematic faces, dynamic schematic 
faces, static photographs of faces, static photographs of people 
in complex social scenes, dynamic images of people in com- 
plex social scenes, situations with the potential for real social 
interaction, and real social interactions. Studies that compare 
responses to stimuli from different sections of this continuum 
show significant differences. For instance, imaging studies show 
that direct gaze elicits significantly greater brain response than 
either gaze aversion or no gaze, but only when participants 
observe live stimuli and not when they observe static images 
(Ponkanen et al., 2010). Similarly, while the propensity to look 
toward another individual's eyes exists across the spectrum from 
schematic faces to dynamic social scenes, potential for actual 
social interaction significantly affects this propensity (Risko et al., 
2012). These results highlight key limitations in the use of exist- 
ing "third-person" paradigms in the study of ToM mechanisms 
and motivate the use of alternative approaches that afford study- 
ing "first-person" or "second-person" social cognition in "real" 
experimental paradigms. 
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FIGURE 2 I Illustrations of three simulated experimental paradigms: (1) Immersive virtual environments, (2) physical environment with a virtually 
simulated character, and (3) physical environment with a humanlike robot. 



FIRST- AND SECOND-PERSON APPROACHES TO STUDYING ToM 

An emerging approach that seeks to build a first- or second- 
person understanding of ToM mechanisms and processes is the 
use of simulation-based computational methods such simulated 
social interaction (Blascovich et al., 2002) and cognitive simula- 
tion (Scassellati, 2002). These methods draw on advancements 
in computer sciences to employ complex computational systems 
that enable the simulation of embodied, situated interactions 
and thus the development of protocols with great experimental 
control and ecological validity. 

Simulated social interaction 

Simulated social interaction involves generating social behavior 
in artificial agents such as virtual characters, which are often 
embedded in immersive virtual environments or as humanlike 
robots. Simulated social interaction offers greater experimental 
control and ecological validity than do traditional "reel" exper- 
imental paradigms (Loomis et al., 1999; Blascovich et al., 2002; 
MacDorman and Ishiguro, 2006; Mutlu et al, 2009; Wilms et al., 
2010). In these experimental paradigms, participants interact 
with simulated others whose behaviors are precisely controlled to 
reflect experimental manipulations and who respond to changes 
in the participants' behaviors affording interactions that more 
closely resemble real-world interactions than static simuli do. 
These interactions might take place in immersive virtual envi- 
ronments, in the physical environment with a virtually simulated 
character (Pelphrey and Carter, 2008; Wilms et al., 2010), or in 
the physical environment with a humanlike robot (MacDorman 
and Ishiguro, 2006; Mutlu et al, 2009) (Figure 2). 

Simulated social interaction offers a number of advantages 
over traditional experimental paradigms used for the study of 
ToM including precise control of experimental stimuli, inter- 
active, dynamic social interaction, on-line processing and mea- 
surement. Thus, these methods more closely approximate the 
ToM demands of everyday interactions. Simulations of social 
stimuli follow computational representations of human behav- 
ior, which provide the experimenter with control parameters for 
the behavior or mechanism under study and the ability to create 
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FIGURE 3 I The two types of gaze behaviors designed as experimental 
stimuli: referential (left) and afflllatlve (right) (Andrist et al., 2012). 



experimental manipulations that are impossible or infeasible for 
human confederates to perform. In a recent study, Andrist et al. 
(2012) developed a parametric model of gaze shifts and precisely 
manipulated how much the head of a virtual character aligned 
with its gaze target, creating two kinds of gaze shifts: affiliative 
and referential. Affiliative gaze shifts involved the virtual character 
aligning the head more toward the participant when it shifted its 
gaze in the environment, while referential gaze shifts involved the 
character aligning its gaze more with its gaze target (Figure 3). 
The results showed that affiliative gaze increased subjective evalu- 
ations of the character and the interaction, while referential gaze 
increased recall of information from the environment (Andrist 
et al, 2012). 

The experimental paradigms involved in simulated social sit- 
uations go beyond "presenting social stimuli," but also offer 
reciprocal social interaction. For instance, Wilms et al. (2010) 
developed an experimental paradigm in which a virtual charac- 
ter shifted its gaze jointly with the participant using real-time 
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eye-tracking input. Their results showed that establishing joint 
attention to an object of interest elicited greater activity in the 
brain (medial prefrontal cortex and posterior cingulate cortex) 
than did attending to the object non-jointly. The simulation 
approach also affords on-line processing and measurement of 
social interaction. For instance, another imaging study showed 
that a virtual character elicited greater brain activation in the 
superior temporal sulcus when it established mutual gaze with 
participants than it did when it averted its gaze as it passed by 
participants in the virtual world (Pelphrey et al., 2004), allowing 
researchers to measure an on-line neural response that could be 
captured by reflective approaches. 

The precise control, interactivity, and on-line processing 
afforded by this experimental approach offer greater ecological 
validity for the study of ToM and social cognition. In a study that 
embodies these characteristics, Mutlu et al. (2009) explored how 
leakage cues — non-verbal cues that individuals give off on their 
thoughts, beliefs, and intentions — presented by a robot might 
elicit ToM inferences in participants. Participants played a ver- 
sion of the 20-Questions guessing game with a humanlike robot 
in which the robot covertly picked one of the items that were laid 
on a table and the participants tried to guess which item the robot 
picked by asking questions that the robot could only answer with 
"Yes" and "No" (Figure 4). In half of the trials, the robot pro- 
duced a brief gaze shift toward the item before answering the 
questions, producing a leakage cue, while it did not shift its gaze 
in the other half. Participants were able to identify the robot's pick 
with fewer questions and in shorter time when the robot pro- 
duced leakage cues than they did when the robot did not leak 
information, suggesting that the participants used the socially rel- 
evant information to make ToM inferences and to more effectively 



narrow down the response options. Furthermore, the majority 
of participants did not report noticing leakage cues or using this 
information in the task, suggesting an implicit processing of such 
cues. This experimental paradigm offers the ability to precisely 
control the gaze cues presented by the robot, present these stimuli 
in an interactive, dynamic protocol, and support on-line pro- 
cessing toward shaping subsequent behavior in the interaction. 
The experiment also contextualized ToM processes in a simu- 
lated interaction that closely resembled face-to-face interaction 
and captured effects of ToM processes on objective measures of 
task performance. 

Cognitive simulation 

While simulated social interaction offers the ability to generate 
precisely controlled social behaviors in artificial agents and to 
create dynamic, interactive experimental scenarios, this approach 
relies primarily on pre-scripted and experimenter-controlled 
protocols — often referred to as Wizard-of-Oz studies (Dahlback 
et al., 1993). This approach does not afford the study of ToM 
processes in complex interactions such as joint action scenar- 
ios (Sebanz et al., 2006) that require all parties to engage ToM 
mechanisms in coordinating their actions toward a common 
goal. A complementary approach to simulated social interac- 
tion is cognitive simulation, which seeks to develop artificial 
representations of neurocognitive mechanisms such as imitation 
and perception of self, simulate them in artificial agents such as 
humanlike robots, and assess their functioning in enabling ToM 
inferences in human-agent interactions (Breazeal and Scassellati, 
2002; Scassellati, 2002; Michel et al, 2004). 

Building on simulation theory (Gallese and Goldman, 1998), 
cognitive simulation involves the robot establishing and main- 
taining representations of the mental states of its human counter- 
parts by tracking and matching their states with resonant states of 
its own. These representations enable the robot to take the per- 
spective of its human counterparts, make inferences about the 
human's goals, and learn from their actions. For instance, Trafton 
et al. (2005) developed a cognitive simulation that enabled the 
robot to simulate "alternative worlds" and assess propositions 
about these worlds in order to make inferences about the perspec- 
tive of its human counterparts. A similar approach by Breazeal 
et al. (2006) involved the robot maintaining separate sets of 
"beliefs" in its belief system for itself and for its human coun- 
terparts. The separate sets of beliefs enabled the robot to identify 
differences in its beliefs from those of its human counterparts in 
order to plan actions that it might take or skills it might learn in 
order to establish a shared set of beliefs. 

Examples of the cognitive simulation approach also include 
simulations of the motor-resonance mechanism (Blakemore and 
Decety, 2001) for enabling ToM inferences in artificial agents. 
These examples build on the finding that observing the actions 
of others elicits automatic activation of motor representations 
associated with these actions and enables predictions about 
action consequences. For instance. Gray et al. (2005) devel- 
oped a system in which the robot parsed the actions of its 
human counterparts and matched them to actions in its own 
repertoire in order to make inferences about the goals of the 
participant. These inferences enabled the robot to perform 




Protocol with stylized humanlike robot 



FIGURE 4 I Participants interacting with two liumanlil<e robots in an 
interactive protocol designed to study how individuals might use 
nonverbal leakage to mal<e ToM inferences (Mutlu et al., 2009). 
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task-level simulations and track the participant's progress in the 
task in order to anticipate the needs of its partner and offer 
help. A model developed by Bicho et al. (2011) extended this 
paradigm by including a mapping between observed actions 
and complementary actions required to successfully complete 
a task and enabled the robot to more effectively coordi- 
nate its actions with those of its human counterparts in the 
task. 

The preceeding examples illustrate how cognitive simulation 
might complement the simulated interaction approach for study- 
ing ToM mechanisms by simulating ToM processes in artificial 
agents. When coupled, the two approaches promise two key 
methodological advances in the study of ToM. First, they help in 
assessing existing neurocognitive models of ToM mechanisms by 
computationally simulating them and observing system behav- 
ior in interactive situations. Second, they enable empirical studies 
to build new understanding of ToM processes in truly interac- 
tive protocols in which all agents — human or artificial — involved 
in the interaction employ ToM mechanisms. The coupling of 
the two approaches extends the methodological advantages of 
simulated social interaction by enabling not only on-line mea- 
surement of responses to social stimuli but also on-the-fly pre- 
cise control over simulated cognitive mechanisms and social 
behaviors, thus affording even greater experimental control. The 
truly interactive setting of the protocols enabled by the cou- 
pUng of these two approaches also improves their ecological 
validity. 

EXAMPLE PARADIGMS FOR STUDYING ToM MECHANISMS 

Recent research across many fields of social, cognitive, and com- 
putational sciences has developed first- and second-person exper- 
imental methods to study ToM mechanisms with the goal of 
gaining a better understanding of these mechanisms and design- 
ing artificial agents that effectively interact with people. The 
paragraphs below illustrate paradigms that study the three key 
ToM mechanisms described earlier. 

Shared world knowledge 

In making ToM inferences, individuals draw on a shared world 
knowledge to integrate information from various sources includ- 
ing resources in the environment, knowledge about social norms, 
the goals of the interaction, the relationship among interac- 
tion partners, and the participation structure of the interac- 
tion. Mumm and Mutlu (2011) investigated how the relation- 
ship between a humanlike robot and participants affected the 
participants' preferences for interpersonal distance, creating a 
"likable" or "unlikable" humanlike robot using verbal fram- 
ing. Participants responded to the unlikable robot's attempts at 
increasing intimacy using mutual gaze by physically distancing 
themselves from the robot (Figure 5), while they did not change 
their proxemic behavior with a likable robot. The results suggest 
that the experimental protocol successfully established differ- 
ent relationships between the robot and the participant across 
the two conditions, which in turn shaped their preferences for 
interpersonal distance, enabling on-line processing of gaze stim- 
uli and measurement of preferences for interpersonal distance 
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Gaze follow Gaze avoid 

FIGURE 5 1 Participants maintaining different amounts of distance 
between themselves and an "unlikable" robot in response to the 
robot's attempts to increase intimacy by following or avoiding the 
participants with its gaze (IVIumm and Mutlu, 2011 1. 





directly from the distance that the participants maintained with 
the robot. 

Another study by Mutlu et al. (2009) explored how a robot 
might use gaze cues to signal participation structure in a three- 
party conversation and how this structure shaped participants' 
conversational behavior. There were three conditions, which dif- 
fered according to the percent of time the robot gazed at each 
of the two participants in that condition. In one condition, 
the robot looked exclusively at one participant (the addressee), 
signaling that the second participant had the role of over- 
hearer. In the second condition, the robot looked mostly at the 
addressee and occasionally at the second participant, indicating 
a role of bystander. In the third condition, gaze was distributed 
equally between the two participants, indicating that both were 
addressees. The results from the study showed that the partic- 
ipants conformed to the norms of their signaled roles in their 
conversational behavior 97% of the time and rated their experi- 
ence with the conversation and involvement in the group con- 
sistently with their conversational roles (e.g., feeling excluded 
in the overhearer condition). The simulated social interaction 
approach enabled precise control of the robot's gaze behav- 
iors to signal specific participation structures and illustrated 
how individuals integrate a perception of the robot's behav- 
ioral cues and their world knowledge, particularly the norms 
associated with the participation structure of a conversation, 
to make ToM inferences about the conversational intentions 
of the robot and follow the norms associated with inferred 
intentions. 

Perception of social cues 

ToM inferences are also informed by perceptions of social cues 
such as gaze. The study by Mutlu et al. (2009) illustrates how ToM 
might be studied using simulated social interaction. Examples 
of simulation-based protocols also include studies that explore 
how the precise temporal and spatial congruency of such cues 
might affect outcomes such as joint attention, information recall, 
and task performance (Staudte and Crocker, 2011; Huang and 
Mutlu, 2012). Staudte and Crocker (2009) developed a video- 
based experimental paradigm in which a robot presented factual 
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Statements about objects in the environment such as "the sphere 
is next to a pyramid" and the robot looked toward the refer- 
enced objects (e.g., sphere and pyramid) 800-1000 milliseconds 
before the object names were spoken. The study manipulated 
the congruency between the robot's linguistic and gaze refer- 
ences and showed that participants confirmed the correctness 
of the statements faster when the two references were congru- 
ent. The simulation-based approach not only ensured that gaze 
cues were presented with precise timings but also enabled the 
presentation of incongruent cues in gaze and speech that is dif- 
ficult to reliably produce by human confederates in an interactive 
protocol. 

Another study by fluang and Mutlu (2012) extended these 
results by comparing congruent gaze and speech cues to tem- 
porally incongruent cues by introducing a delay into the robot's 
gaze shifts such that they were produced 500 — 1000 milliseconds 
after the onset of linguistic references. The study also contex- 
tualized these comparisons in two realistic tasks. The first task 
involved the robot recounting a story and referring — using lin- 
guistic and gaze cues — to a set of props. In the second task, 
the robot provided instructions to participants to complete a 
sorting task, referring to objects to be sorted using linguistic 
and gaze cues (Figure 6). The results from the study showed 
that participants recalled the details of the story better in the 
first task than the second task, and were faster at locating to- 
be-sorted objects when the robot used congruent speech and 
gaze cues than when cues were spatially or temporally incon- 
gruent. In addition to extending the results obtained by Staudte 
and Crocker (2009) to a physically situated paradigm, this study 
demonstrated the role of temporal congruency in ToM pro- 
cesses. The study also illustrated the effects of perception of 




Sorting task 



FIGURE 6 I The robot is using gaze cues to establish joint attention 
with the participants in a storytelling task (top) and a sorting task 
(bottom) (Huang and Mutlu, 2012). 



social cues on the outcomes of ToM processes such as infor- 
mation recall and task performance in joint action. When the 
robot used congruent gaze cues, participants established stronger 
associations between objects in the environment and verbal infor- 
mation presented by the robot, thereby showing improved task 
performance. 

Interpretation of actions 

Mechanisms such as action observation (Knoblich et al., 2011) 
also facilitate ToM processes by establishing a "procedural 
common ground" between interaction partners (Clark, 1996). 
Paradigms developed to study this mechanism most commonly 
follow the cognitive simulation approach (e.g., Trafton et al., 
2005; Breazeal et al., 2006) or employ a coupling of the simulated 
social interaction and cognitive simulation (e.g.. Gray et al., 2005; 
Bicho et al., 2011). A recent study used a paradigm in which a 
robot provided participants with instructions on how to assemble 
a structure and, by observing the participant's actions, main- 
tained an internal model of the participants' progress, specifi- 
cally modeling potential breakdowns in the participant's under- 
standing or execution of the instructions (Mutlu et al., 2013) 
(Figure 7). When the robot inferred misunderstandings or confu- 
sion from its observations of participant actions or lack thereof, it 
offered clarifications on the actions that the participant must take 
to successfully progress in the task using conversational repair 
strategies (Clark, 1994). Compared to the robot only provid- 
ing instructions, the robot that made ToM inferences through 
action observation and offered clarifications based on these infer- 
ences received fewer requests for help from the participants when 
breakdowns occurred. This protocol illustrates how the cog- 
nitive simulation and simulated social interaction approaches 
might be combined to create truly interactive experimental 
paradigms in which participants and artificial agents employ ToM 
mechanisms. Cognitive simulation enabled the robot to use its 
observations of participant actions to infer misunderstandings 
and confusions and to plan appropriate actions to offer clari- 
fications. The simulated social interaction approach established 




FIGURE 7 I The robot combines cognitive simulation and simulated 
social behaviors to maintain a model of the participant's progress in 
the task using action observation and to provide feedback using 
conversational repair (Mutlu et al., 2013). 
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communication between the participant and the robot and 
enabled the robot to use conversational strategies to execute its 
plan to offer clarifications. 

LIMITATIONS OF COMPUTATIONAL PARADIGMS 

Simulation-based approaches ofl^er unprecedented methodologi- 
cal advantages for studying social cognition and ToM processes in 
particular. They offer greater experimental control, on-line pro- 
cessing of stimuli, a richer set of measures for ToM outcomes, 
and truly interactive experimental scenarios, enabling the study 
of complex interactions and affording more ecologically valid 
research. These approaches, however, build on a number of 
assumptions and are affected by a number of factors that limit 
their promise. 

Differences in simulated and real interactions 

A key assumption that the simulation-based approaches dis- 
cussed above make is that simulations match "real" stimuli and 
that interactions with artificial agents are equivalent to inter- 
actions with humans. While theoretical accounts such as the 
mindlessness hypothesis (Langer, 1992; Nass and Moon, 2000) 
and the findings presented above suggest that simulated interac- 
tions closely approximate human interactions, a critical exami- 
nation of the differences between these forms of interactions is 
essential. 

Design of computational platforms 

Another factor that might limit the generalizability of findings 
obtained in simulated approaches is the design of the simula- 
tion platform. For instance, the experimental protocol developed 
by Mutlu et al. (2009) to study leakage gaze cues employed two 
robot platforms to investigate whether the design of the specific 
robot platform used in the study affected these inferences. The 
results showed that leakage gaze cues elicited ToM inferences, thus 
improving participant performance in guessing the robot's pick, 
when presented by a highly humanlike android but not when 
presented by a robot with a stylized design that featured cartoon- 
like features (although a trend in the data toward an effect was 
present). This finding suggests that the physical design of the 
robot platform might affect social cognition and ToM inferences 
with artificial agents and that different designs might vary in the 
extent to which they activate ToM mechanisms. While the choice 
of experimental materials and the design of experimental stimuli 
has been found to affect social cognition in conventional exper- 
imental paradigms, such as the attractiveness of faces shown in 
stimuli interacting with gaze manipulations (Kampe et al., 2001), 
a better understanding of how design features for artificial agents 
affect social cognition and the development of benchmark plat- 
forms for research in social cognition might help isolate such 
effects. 

Simplifications in simulations 

Simulations of cognitive and behavioral mechanisms necessar- 
ily involve simplifications in representation and presentation due 
to difficulties in capturing the complexity of these mechanisms. 
Such simplifications might result in artificial agents displaying 
"robot-like" behaviors that fail in activating ToM mechanisms 
or eliciting feelings of eeriness or discomfort — an effect often 



referred to as the "uncanny valley" (Mori, 1970). Recent find- 
ings indicate that exposure to artificial agents results in a shift in 
what participants consider to be acceptable or preferred human 
qualities (Chen et al., 2010), suggesting that properly acclimat- 
ing participants to simulations might alleviate such effects or that 
people might attune to simplified behaviors as artificial agents 
become commonplace. 

Individual differences 

A final consideration in the use of simulation-based experimen- 
tal paradigms is individual differences in responses to artificial 
agents. Studies to date have identified individual differences based 
on participant sex (Mutlu et al., 2006), cultural background 
(Nomura et al, 2008), personality (Walters et al, 2008), and pet 
ownership (Mutlu et al, 2009). While such differences might also 
be observed in conventional experimental paradigms, whether 
these differences predict those in simulation-based paradigms is 
unknown. Simulation-based studies of ToM processes must con- 
sider demographic and attitudinal factors to establish a better 
understanding of their effects. 

SUMMARY AND CONCLUSIONS 

The study of ToM has provided much knowledge regarding 
how the ability to reason about mental states develops in typ- 
ical children and insight into impaired ToM in clinical pop- 
ulations. The tasks developed through this research largely 
measure individuals' abilities to utilize and integrate informa- 
tion from shared world knowledge, social cues, and physical 
actions to infer the mental states of others to predict future 
behavior. While the passive and reflective methods tradition- 
ally used to study ToM have been invaluable to understand- 
ing the mechanisms associated with the ability to infer mental 
states, we, with others (Risko et al., 2012), argue that investi- 
gations using interactive, on-line approaches to study ToM and 
social cognition hold great promise to enrich the established 
knowledge base. As described, emerging methods, including the 
use of virtual and robotic platforms, provide opportunities to 
study human behavior in social interactions while maintain- 
ing experimental control. These emerging interactive and com- 
putational methods may, in addition to extending knowledge 
regarding ToM processing, enable the development of innovative, 
technology-driven protocols for improving mentalizing in cases 
of disorder. 
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